Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: fsync sideload sst writes every 2MB #20449

Merged
merged 1 commit into from
Jan 8, 2018
Merged

Conversation

dt
Copy link
Member

@dt dt commented Dec 4, 2017

#20352 configured rocksdb to sync every 512kb. This does the same for our sst sideload file writes.

@dt dt requested review from tbg, a-robinson and a team December 4, 2017 18:09
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@a-robinson
Copy link
Contributor

🎉 Have you tested this out to make sure data gets flushes as intended? Would you like me to?


Reviewed 3 of 3 files at r1.
Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


pkg/storage/replica_proposal.go, line 341 at r1 (raw file):

		}

		if err := fileutil.WriteFileSyncing(path, sst.Data, 0600, 512<<10); err != nil {

Nit, but is there any reason for all callers to have to provide 512<<10 instead of having a named constant somewhere or just not even keeping it as an argument to WriteFileSyncing?


pkg/util/fileutil/syncing_write.go, line 22 at r1 (raw file):

)

// WriteFileSyncing is esseially ioutil.WriteFile -- writes data to a file named

esseially


pkg/util/fileutil/syncing_write.go, line 52 at r1 (raw file):

	}

	if err == nil {

Shouldn't we close the file even if there was an error? Leaking the file descriptor doesn't seem like a good idea.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Dec 4, 2017

:lgtm: mod @a-robinson's comments.


Reviewed 3 of 3 files at r1.
Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


Comments from Reviewable

@dt
Copy link
Member Author

dt commented Dec 4, 2017

Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


pkg/storage/replica_proposal.go, line 341 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Nit, but is there any reason for all callers to have to provide 512<<10 instead of having a named constant somewhere or just not even keeping it as an argument to WriteFileSyncing?

Done.


pkg/util/fileutil/syncing_write.go, line 22 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

esseially

Done.


pkg/util/fileutil/syncing_write.go, line 52 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Shouldn't we close the file even if there was an error? Leaking the file descriptor doesn't seem like a good idea.

Whoops, yes. I meant to close it unconditionally and overwrite err with close's err only when it was nil.
Done.


Comments from Reviewable

@dt
Copy link
Member Author

dt commented Dec 4, 2017

I haven't tested it yet -- was going to pick up your rocks config change too and try out a build with both of them, which, theoretically, should have no giant writes happening during a big restore.


Review status: 0 of 3 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


Comments from Reviewable

@a-robinson
Copy link
Contributor

:lgtm: pending testing. The necessary rocksdb changes are in as of this morning. One thing to check to make sure this is working as intended is that during the restore you don't see the output of grep Dirty /proc/meminfo grow too much. In my testing with two rocksdb instances (one primary and one temp) it never grew past 10MB when things were working properly, and regularly got into the high 10s of MBs or hundreds of MBs when things weren't working properly. Unless we're writing a lot of these files in parallel, I'd expect it to stay below 10MB of dirty data.


Reviewed 3 of 3 files at r2.
Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


Comments from Reviewable

@dt
Copy link
Member Author

dt commented Dec 5, 2017

Good news and bad news.

2GB restore, 4 node roachprod gce cluster:

Good news: On master, running grep Dirty /proc/meminfo every second, peaked at Dirty: 1313672 kB. With this change: Dirty: 448 kB.

Bad news: that 2GB restore went from 36s to 139s. Yikes.

@a-robinson
Copy link
Contributor

Ouch, that's pretty brutal. As discussed at lunch, syncing more than 512KB at a time may help. We may be able to parallelize a bit as well if that doesn't work out, but hopefully syncing a little more at a time gets us back to a more reasonable place.

It's nice that we won't have more than a gigabyte of dirty data waiting to be flushed anymore!

@dt
Copy link
Member Author

dt commented Dec 5, 2017

Syncing every 4mb was more like 117s and every 8mb got it to 101s. Interestingly, even at 8mb, I still didn't observe a dirty number over ~300kb, though I was only polling once per second.

@dt
Copy link
Member Author

dt commented Dec 5, 2017

one theory was that we could handle more concurrency now that we're not waiting for syncs, but, at first glance, upping the concurrent import request limit just makes it slower and bring back liveness errors :/

more digging is required.

@dt
Copy link
Member Author

dt commented Dec 5, 2017

bumping importRequestLimit from 1 to 2 seems to cause a mild slow-down, and going to 3 worsens it and even causes sporadic slow heartbeat messages, so it doesn't look like turning that knob alone is a win.

maybe we want a background fsync and some syncs-in-flight limit, but eh -- that feels slightly like re-inventing what the kernel is supposed to be doing, and we do actually want ssts synced before we call the restore done, which is very much is not currently.

@a-robinson
Copy link
Contributor

Well that's frustrating. You and the bulk I/O team will have to decide what sort of performance hit is acceptable.

I was wondering, though, did you test with vs without fsync for restores larger than 2GB? It'd be useful to know how the difference scales -- is fsyncing always 2.5x slower, or does it just add 60 seconds to the end of a restore, or somewhere in between?

@petermattis
Copy link
Collaborator

It might be a correctness bug that the sideloaded sstables were not previously being synced. An inopportune crash could remove data. I think we have to sync these files after they are written. See also https://stackoverflow.com/questions/15348431/does-close-call-fsync-on-linux.

@dt
Copy link
Member Author

dt commented Dec 6, 2017

Yeah, agreed -- especially since once that restore finishes, we're happy to serve reads against those sstables, and on that 2gb restore, it looked like we still had 650mb+ reported as dirty well after RESTORE returned. So some of the "slow down" here is just that the previous restore numbers were a little misleading, since we hadn't actually written to disk after all.

Syncing every 8mb and syncing only after writing each file looks about the same, but I still saw a few slow heartbeat warnings. 4mb and 2mb each look like they add a little more slowdown, presumably indicating that we're waiting for syncs when we could be getting more work done, but get rid of liveness complaints.

Switched it to a setting, so we can continue to play with it. RFAL.

@benesch
Copy link
Contributor

benesch commented Dec 12, 2017

According to my read of the ext4 docs—and my understanding is that all of our local SSDs are mounted as ext4—the mount options default to commit=5s if unset, which means that all dirty data/metadata should be flushed to disk every 5s, even in the absence of fsync calls. I don't have a great intuition for how long syncing 600MB should take, but I'd think on the order of a few seconds. So I guess I don't understand why so many pages were dirty for so long in a non-sycing restore. When you say well after, @dt, do you mean a couple seconds or a couple minutes?

It is of course quite possible that I'm misreading/misunderstanding something.

@benesch
Copy link
Contributor

benesch commented Dec 12, 2017

I'd also be curious to see if syncing once, right before the restore commits, would be sufficient.

In any case, these aren't complaints about the approach in this PR! Just musings.

@dt
Copy link
Member Author

dt commented Dec 12, 2017

@benesch just for a couple seconds... but that is still after we'd published the table descs and have promised durability.

After switching to a cluster setting, I did a trial run with the syncSize = 128mb, which should mean just one sync per sst, after writing the whole file, and, while that was a tiny bit faster than 4 or 2mb, it was still in the ~100s range, not the <40s range of no-sync, though, I was seeing some heartbeat complaints on those runs.

@dt
Copy link
Member Author

dt commented Dec 12, 2017

Oh, you meant one global sync, rather than individual files, sorry. While I guess that would get us consistency at the SQL level, since we probably don't care about those ranges until the RESTORE completes, I'm a little wary of the individual AddSSTable commands completing without having the data safely on disk? Also, the original motivation for the smaller syncs was partly to avoid avoid the QoS issue, where the IO scheduler blocks all our small writes to chew through the big one.

@dt dt changed the title storage: fsync sideload sst writes every 512kb storage: fsync sideload sst writes every 2MB Dec 19, 2017
@dt
Copy link
Member Author

dt commented Dec 19, 2017

Any objections to merging this, as-is, and then investigating any improvements we want to make in followups? As-is, we're kinda lying by not syncing at all.

Copy link
Contributor

@a-robinson a-robinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objections from me. A benchmark (or at least instructions to do the same test run you've been doing) would be nice, though.

@benesch
Copy link
Contributor

benesch commented Dec 27, 2017

Oh, you meant one global sync, rather than individual files, sorry. While I guess that would get us consistency at the SQL level, since we probably don't care about those ranges until the RESTORE completes, I'm a little wary of the individual AddSSTable commands completing without having the data safely on disk? Also, the original motivation for the smaller syncs was partly to avoid avoid the QoS issue, where the IO scheduler blocks all our small writes to chew through the big one.

Yeah, one sync at the very end is definitely not viable from a raft consistency perspective. It's just shocking to me that periodic fsyncs cause such a slowdown when the data in question can be flushed to disk in a few seconds.

cockroachdb#20352 configured rocksdb to sync every 512kb. This does the same for our sst sideload file writes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants